FINAL PROJECT

By Camden Masters and Shreelekha Revankar

Finding the best wine for your budget Walking into the liquor store, the wine section can be a daunting experience. As with any shopping experience you want to know that you are getting the best thing for your budget. If you like what you got, you probably would want to try something similar as well. There are many varieties of wine and the prices range from a few dollars to a few thousands. What if there was a way to make wine purchasing more efficient? This tutorial is broken down into two parts. The first section handles finding the best wine for a specific price range for the variety of wine of your choosing. The second section handles finding the most similar wine to any given wine based on variety, points, and price. This dataset consists of 130 thousand listings of wines from around the world and are rated by professional wine tasters. The prices range from $4 t0 $3300 per bottle.

We believe that this is a good example of how data science can be applied to optimize everyday tasks. Not only saving us money but also giving our pallets something to enjoy. The research done in this tutorial can also be applied to far broader topics and things beyond just wine. The second half of our tutorial is used to find the most similar item to any given search items. Combined this tutorial gives us the tools that are useful for a multitude of online retailers and subscription-based services.

In [47]:
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.axes as ax
import seaborn as sns
import numpy as np
In [48]:
wine_table = pd.read_csv('/winemag-data_first150k.csv')
#Provide the ability to show the entirety of the data when not looking at the head
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
wine_table.head()
Out[48]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz
1 1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez
2 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley
3 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi
4 4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude

Creating numerical points for non-numerical data

In [54]:
# Finding unique varieties of wine in data 

print('Variety:')

varieties = list(set(wine_table['variety'].to_list()))
wine_table['variety'].sort_values().unique()


# Displaying a sample of varieties
print(varieties[0:50])
Variety:
['Valdiguié', 'Chardonnay-Albariño', 'Yapincak', 'Malbec', 'Rhône-style Red Blend', 'Mazuelo', 'Alicante Bouschet', 'Kekfrankos', 'Tinta Negra Mole', 'Muscat', 'Shiraz-Cabernet Sauvignon', 'Plavac Mali', 'Pinot Blanc-Chardonnay', 'Freisa', 'Sarba', 'Vitovska', 'Merlot-Grenache', 'Malbec-Bonarda', 'Portuguese Rosé', 'Sangiovese', 'Sauvignon Blanc-Semillon', 'Siria', 'Irsai Oliver', 'Mavrud', 'Gamay', 'Muscatel', 'Aleatico', 'Roscetto', 'Duras', 'Lambrusco di Sorbara', 'Terret Blanc', 'Rosé', 'Riesling', 'Malagouzia', 'Mencía', 'Verduzzo', 'Morio Muskat', 'Jaen', 'Syrah-Petite Sirah', 'Viura-Sauvignon Blanc', 'Hárslevelü', 'Sangiovese-Syrah', 'Tai', 'Romorantin', 'Refosco', 'País', 'Merlot', 'Pugnitello', 'Sacy', "Muscat d'Alexandrie"]
In [55]:
# We will be using the index number from this set to act as the unique number identifier for each variety

print(varieties.index('Malbec'))
3

Creating a 3-d point to represent a specific wine

Here in the data frame below, one can see two Malbec wines with very similar points and prices, our goal is to be able to find the most similar wine to our search wine as possible.

First we need to create an (x,y,z) coordinate to represent the (variety,price,points)

In [56]:
# Given a specific wine, we want to be able to predict the wine closest or most similar to the specified wine based on variety, price and score

is_Malbec =  wine_table['variety']=='Malbec'
malbec_table = wine_table[is_Malbec]

# Checking to see that filter worked
print('Malbec:')
print(malbec_table['variety'].sort_values().unique())
malbec_table.head()
Malbec:
['Malbec']
Out[56]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery custom_price_range
18 18 France Coming from a seven-acre vineyard named after ... Le Pigeonnier 95 290.0 Southwest France Cahors NaN Malbec Château Lagrézette >500
60 60 Argentina Concentrated, ripe blackberry and cassis aroma... The Apple Doesn't Fall Far From The Tree 91 30.0 Mendoza Province Mendoza NaN Malbec Matias Riccitelli 20-50
62 62 Argentina Smoky aromas of fresh-cut wood blend with berr... Alegoría Gran Reserva 91 25.0 Mendoza Province Mendoza NaN Malbec Navarro Correas 20-50
115 115 Argentina Aromas of prune, raisin and black plum are ful... Reserva 86 15.0 Mendoza Province Valle de Uco NaN Malbec Viñalba 0-20
190 190 Argentina Dark-berry aromas are crisp and show a note of... Reserve 87 15.0 Mendoza Province Tupungato NaN Malbec Tupun 0-20
In [57]:
# Creating points from wine rows to find euclidean distance
# Taking the example of an under 20 dollar bottle of Malbec (This corresponds to row 115 in the data set)
wine = wine_table.loc[115]
print(wine)
print("===================================")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("===================================")

variety_id = varieties.index(wine['variety'])

# intializing point for search wine
search  = np.array(((varieties.index(wine['variety'])), (int(wine['price'])), (wine['points']))) 

print("search element:")
print(search)
Unnamed: 0                                                          115
country                                                       Argentina
description           Aromas of prune, raisin and black plum are ful...
designation                                                     Reserva
points                                                               86
price                                                                15
province                                               Mendoza Province
region_1                                                   Valle de Uco
region_2                                                            NaN
variety                                                          Malbec
winery                                                          Viñalba
custom_price_range                                                 0-20
Name: 115, dtype: object
===================================
Malbec
3
15
86
===================================
search element:
[ 3 15 86]

Calculating Similarity

Here we use the Euclidean distance between our search wine and any other wine to see which one is the most similar. As you can see, from the two wines below, a wine of the same variety and similar price and points will have a much smaller "distance" from our search wine than that of a wine of different variety, and less similar price and points

A more basic example using simple integers can be found at: https://www.w3resource.com/python-exercises/math/python-math-exercise-79.php

In [58]:
# Calculating Euclidean distance for two random wines

elem1 = wine_table.loc[86]
print("first random wine: ")
print(elem1)
element1 = np.array(((varieties.index(elem1['variety'])), (int(elem1['price'])), (elem1['points'])))
print("")
print("first random wine point: ")
print(element1)
print("====================================================================================")


elem2 = wine_table.loc[190]
print("second random wine: ")
print(elem2)
element2 = np.array(((varieties.index(elem2['variety'])), (int(elem2['price'])), (elem2['points'])))
print("")
print("second random wine point: ")
print(element2)
print("====================================================================================")
print("")

print("search element:")
print(search)
print("====================================================================================")

# printing Euclidean distance 
print("")
print("Difference between first random wine and our wine")
dist1 = np.linalg.norm(search - element1) 
print(dist1)
print("")
print("Difference between second random wine and our wine")
dist2 = np.linalg.norm(search - element2) 
print(dist2)
first random wine: 
Unnamed: 0                                                           86
country                                                          France
description           This delicious, refreshing wine is textured, t...
designation                                       Montmains Premier Cru
points                                                               91
price                                                                45
province                                                       Burgundy
region_1                                                        Chablis
region_2                                                            NaN
variety                                                      Chardonnay
winery                                         Domaine Gérard Duplessis
custom_price_range                                                20-50
Name: 86, dtype: object

first random wine point: 
[391  45  91]
====================================================================================
second random wine: 
Unnamed: 0                                                          190
country                                                       Argentina
description           Dark-berry aromas are crisp and show a note of...
designation                                                     Reserve
points                                                               87
price                                                                15
province                                               Mendoza Province
region_1                                                      Tupungato
region_2                                                            NaN
variety                                                          Malbec
winery                                                            Tupun
custom_price_range                                                 0-20
Name: 190, dtype: object

second random wine point: 
[ 3 15 87]
====================================================================================

search element:
[ 3 15 86]
====================================================================================

Difference between first random wine and our wine
389.1901848710987

Difference between second random wine and our wine
1.0

Here we see the difference between wines, using this we can find the wine with the smallest difference from our search wine, not including itself

In [59]:
# looping through data-frame to find most similar wine and recording the wine with the smallest difference

print("Search Element:")
print("")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("")
print(search)
print("")
print("====================================================================================")



print("Total number of Wines:")
print(len(wine_table.index))

print("")
# For the sake of saving calculation time and space I will be running the distance algorithm on a filtered set of wines
# This set is the set of wines of the same variety
print("Total number of Malbec wines:")
print(len(malbec_table.index))
Search Element:

Malbec
3
15
86

[ 3 15 86]

====================================================================================
Total number of Wines:
150930

Total number of Malbec wines:
3208
In [60]:
# Here we could have used the entire table but for the sake of the tutorial and calculation time we will use the filered table
def findSimilar():
    min_row = 0
    min_dist = 100.0
    
    for x in malbec_table.dropna().index:
        elem = wine_table.loc[x]
        point = np.array(((varieties.index(elem['variety'])), (int(elem['price'])), (elem['points'])))
        dist = np.linalg.norm(search - point) 
        if dist < min_dist and dist > 0:
            min_dist = dist
            min_row = x
            
    return min_row
##
similar = findSimilar()

print("Wine most similar to our input row is:")
most_sim = wine_table.loc[similar]
print(wine_table.loc[similar])
Wine most similar to our input row is:
Unnamed: 0                                                         8500
country                                                              US
description           The generous fruit flavors in this medium-bodi...
designation                                     The Heritage Collection
points                                                               86
price                                                                14
province                                                     California
region_1                                                           Lodi
region_2                                                 Central Valley
variety                                                          Malbec
winery                                                          Peirano
custom_price_range                                                 0-20
Name: 8500, dtype: object
In [61]:
print("Search Wine:")
print("")
print(wine['variety'])
print(int(wine['price']))
print(wine['points'])
print("")
print(wine)
print("______________________________")
print("Most Similar Wine:")
print("")
print(most_sim['variety'])
print(int(most_sim['price']))
print(most_sim['points'])
print("")
print(most_sim)
print("")
Search Wine:

Malbec
15
86

Unnamed: 0                                                          115
country                                                       Argentina
description           Aromas of prune, raisin and black plum are ful...
designation                                                     Reserva
points                                                               86
price                                                                15
province                                               Mendoza Province
region_1                                                   Valle de Uco
region_2                                                            NaN
variety                                                          Malbec
winery                                                          Viñalba
custom_price_range                                                 0-20
Name: 115, dtype: object
______________________________
Most Similar Wine:

Malbec
14
86

Unnamed: 0                                                         8500
country                                                              US
description           The generous fruit flavors in this medium-bodi...
designation                                     The Heritage Collection
points                                                               86
price                                                                14
province                                                     California
region_1                                                           Lodi
region_2                                                 Central Valley
variety                                                          Malbec
winery                                                          Peirano
custom_price_range                                                 0-20
Name: 8500, dtype: object

In [52]:
#Create custom cut intervals for easier use by a customer
cut_labels = ['0-20', '20-50', '50-100', '100-200','>500']
cut_bins = [0, 20, 50, 100, 200,500]
wine_table['custom_price_range'] = pd.cut(wine_table['price'].to_list(), bins=cut_bins, labels=cut_labels)

#Lists of intervals and varieties that can be looped without duplicates
intervals = wine_table['custom_price_range']
intervals = list(dict.fromkeys(intervals))
variety = wine_table['variety']
variety = list(dict.fromkeys(variety))

for curr in intervals:
    #Query of a mini data table where all custom price ranges match the curr in the interval loop
    qry = wine_table[wine_table['custom_price_range'] == curr]

    #Loop through varieties within this interval
    varieties = qry['variety']
    varieties = list(dict.fromkeys(varieties))
    for var in varieties:
        #For each variety in this custom price range, gather the price and points
        qry2 = qry[qry['variety'] == var]
        pay = qry2['price'].to_list()
        points = qry2['points'].to_list()
        
        pay_sum = 0
        points_sum = 0
        
        for p in pay:
            pay_sum += p
        
        for p in points:
            points_sum += p
            
        x = []
        #Take the average of the price and points for this variety at this price range
        x.append(pay_sum/len(pay))
        y = []
        y.append(points_sum/len(points))

        plt.title('Varieties\' Average Points per Price in the Price Range: ' + curr)
        plt.xlabel('Average Price')
        plt.ylabel('Average Points')
        plt.scatter(x, y)

    plt.figure(figsize=(10, 8))
    plt.show() 
<Figure size 720x576 with 0 Axes>
<Figure size 720x576 with 0 Axes>
<Figure size 720x576 with 0 Axes>
<Figure size 720x576 with 0 Axes>
<Figure size 720x576 with 0 Axes>
<Figure size 720x576 with 0 Axes>
In [ ]:
#violin
points = wine_table['points'].to_list()
wine_table['points_range'] = pd.cut(points, 10)

varieties = wine_table['variety']
varieties = list(dict.fromkeys(varieties))
 
#Create graphs for each variety and their target price for points
for var in varieties:
  qry = wine_table[wine_table['variety'] == var]
  plt.figure(figsize=(10, 8))
  sns.violinplot(x=qry['points_range'], y=qry['price'], data=qry).set_title(var)
  plt.xticks(rotation=45)
  plt.show()
In [44]:
#Find the best wine for $15
target = 15.0
range = ''
if 0.0 <= target < 20.0:
  range = '0-20'
elif 20.0 <= target < 50.0:
  range = '20-50'
elif 50.0 <= target < 100.0:
  range = '50-100'
elif 100.0 <= target < 200.0:
  range = '100-200'
elif 200.0 <= target < 500.0:
  range = '200-500'
else:
  range = '>500'

points = 0

qry = wine_table[wine_table['custom_price_range'] == range]
for index, row in qry.iterrows():
  if row['points'] > points:
    points = row['points']
    best_value = row

print(best_value)
Unnamed: 0                                                        56971
country                                                              US
description           Superb fruit highlights this tight, sculpted S...
designation                                                         NaN
points                                                               96
price                                                                20
province                                                     Washington
region_1                                           Columbia Valley (WA)
region_2                                                Columbia Valley
variety                                                           Syrah
winery                                                             Rulo
custom_price_range                                                 0-20
Name: 56971, dtype: object
In [ ]:
#Show our updated data table with added columns
wine_table.head()
Out[ ]:
Unnamed: 0 country description designation points price province region_1 region_2 variety winery price_range points_range custom_price_range
0 0 US This tremendous 100% varietal wine hails from ... Martha's Vineyard 96 235.0 California Napa Valley Napa Cabernet Sauvignon Heitz (157.067, 310.133] (94.0, 96.0] >500
1 1 Spain Ripe aromas of fig, blackberry and cassis are ... Carodorum Selección Especial Reserva 96 110.0 Northern Spain Toro NaN Tinta de Toro Bodega Carmen Rodríguez (1.704, 157.067] (94.0, 96.0] 100-200
2 2 US Mac Watson honors the memory of a wine once ma... Special Selected Late Harvest 96 90.0 California Knights Valley Sonoma Sauvignon Blanc Macauley (1.704, 157.067] (94.0, 96.0] 50-100
3 3 US This spent 20 months in 30% new French oak, an... Reserve 96 65.0 Oregon Willamette Valley Willamette Valley Pinot Noir Ponzi (1.704, 157.067] (94.0, 96.0] 50-100
4 4 France This is the top wine from La Bégude, named aft... La Brûlade 95 66.0 Provence Bandol NaN Provence red blend Domaine de la Bégude (1.704, 157.067] (94.0, 96.0] 50-100